You are given a set of data on housing sale prices for the last few years in King County (near Seattle) between May 2014 and May 2015. Have a look at the variable definitions on the Kaggle page
Tidy up the data ready for regression:
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
[37m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[37m[32m✔[37m [34mggplot2[37m 3.2.1 [32m✔[37m [34mpurrr [37m 0.3.2
[32m✔[37m [34mtibble [37m 2.1.3 [32m✔[37m [34mdplyr [37m 0.8.2
[32m✔[37m [34mtidyr [37m 0.8.3 [32m✔[37m [34mstringr[37m 1.4.0
[32m✔[37m [34mreadr [37m 1.3.1 [32m✔[37m [34mforcats[37m 0.4.0[39m
[37m── [1mConflicts[22m ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31m✖[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(readr)
library(lubridate)
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
house <- read_csv("kc_house_data.csv")
Parsed with column specification:
cols(
.default = col_double(),
id = [31mcol_character()[39m,
date = [34mcol_datetime(format = "")[39m
)
See spec(...) for full column specifications.
summary(house)
id date price bedrooms bathrooms
Length:21613 Min. :2014-05-02 00:00:00 Min. : 75000 Min. : 0.000 Min. :0.000
Class :character 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950 1st Qu.: 3.000 1st Qu.:1.750
Mode :character Median :2014-10-16 00:00:00 Median : 450000 Median : 3.000 Median :2.250
Mean :2014-10-29 04:38:01 Mean : 540088 Mean : 3.371 Mean :2.115
3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500
Max. :2015-05-27 00:00:00 Max. :7700000 Max. :33.000 Max. :8.000
sqft_living sqft_lot floors waterfront view
Min. : 290 Min. : 520 Min. :1.000 Min. :0.000000 Min. :0.0000
1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000
Median : 1910 Median : 7618 Median :1.500 Median :0.000000 Median :0.0000
Mean : 2080 Mean : 15107 Mean :1.494 Mean :0.007542 Mean :0.2343
3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000
Max. :13540 Max. :1651359 Max. :3.500 Max. :1.000000 Max. :4.0000
condition grade sqft_above sqft_basement yr_built yr_renovated
Min. :1.000 Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
Median :3.000 Median : 7.000 Median :1560 Median : 0.0 Median :1975 Median : 0.0
Mean :3.409 Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
Max. :5.000 Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
zipcode lat long sqft_living15 sqft_lot15
Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
Median :98065 Median :47.57 Median :-122.2 Median :1840 Median : 7620
Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987 Mean : 12768
3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10083
Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210 Max. :871200
You might like to think about removing some or all of date, id, sqft_living15, sqft_lot15 and zipcode (lat and long provide a better measure of location in any event).
house_trim <- house %>%
select(-id) %>%
select(-date) %>%
select(-zipcode) %>%
select(-sqft_living15) %>%
select(-sqft_lot15)
Have a think about how to treat waterfront. Should we convert its type? NO
unique(house_trim$waterfront)
[1] 0 1
We converted yr_renovated into a renovated logical variable, indicating whether the property had ever been renovated. You may wish to do the same.
house_boolean <- house_trim %>%
mutate(renovated = as.numeric(yr_renovated != 0)) %>%
select(-yr_renovated)
colnames(house_boolean)
[1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot" "floors"
[7] "waterfront" "view" "condition" "grade" "sqft_above" "sqft_basement"
[13] "yr_built" "lat" "long" "renovated"
Have a think about how to treat condition and grade? Are they interval or categorical ordinal data types?
unique(house_boolean$condition)
[1] 3 5 4 1 2
unique(house_boolean$grade)
[1] 7 6 8 11 9 5 10 12 4 3 13 1
Check for aliased variables using the alias() function (this takes in a formula object and a data set).
alias(price ~ .,
data = house_boolean)
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition
sqft_basement 0 0 0 1 0 0 0 0 0
grade sqft_above yr_built lat long renovated
sqft_basement 0 -1 0 0 0 0
square foot living and above highly correlated to sqft_basement - let’s remove them
house_bool <- house_boolean %>%
select(-sqft_basement) %>%
select(-sqft_above)
Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go splitting datasets into numeric and non-numeric columns might help ggpairs() run in manageable time, although you will need to add either a price or resid column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors.
library(modelr)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Attaching package: ‘GGally’
The following object is masked from ‘package:dplyr’:
nasa
house_num <- subset(house_bool, select = c("price", "bedrooms", "bathrooms",
"sqft_living", "floors",
"lat", "long"))
house_cat <- subset(house_bool, select = c("waterfront", "view", "condition",
"grade", "renovated", "price"))
ggpairs(house_num)
ggpairs(house_cat)
1st model
mod1a_sqft <- lm(price ~ sqft_living,
data = house_bool)
mod1b_grade <- lm(price ~ grade,
data = house_bool)
summary(mod1a_sqft)
Call:
lm(formula = price ~ sqft_living, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1476062 -147486 -24043 106182 4362067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43580.743 4402.690 -9.899 <2e-16 ***
sqft_living 280.624 1.936 144.920 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261500 on 21611 degrees of freedom
Multiple R-squared: 0.4929, Adjusted R-squared: 0.4928
F-statistic: 2.1e+04 on 1 and 21611 DF, p-value: < 2.2e-16
plot(mod1a_sqft)
summary(mod1b_grade)
Call:
lm(formula = price ~ grade, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-816988 -151958 -36158 97842 6046097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1056045 12256 -86.17 <2e-16 ***
grade 208458 1582 131.76 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 273400 on 21611 degrees of freedom
Multiple R-squared: 0.4455, Adjusted R-squared: 0.4454
F-statistic: 1.736e+04 on 1 and 21611 DF, p-value: < 2.2e-16
plot(mod1b_grade)
see if log fixes it (one by one and then both):
plot_1c_log <- mod1a_sqft <- lm(log(price) ~ log(sqft_living),
data = house_bool)
summary(plot_1c_log)
Call:
lm(formula = log(price) ~ log(sqft_living), data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1.10511 -0.29300 0.01262 0.25701 1.33011
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.729916 0.047062 143.0 <2e-16 ***
log(sqft_living) 0.836771 0.006223 134.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3886 on 21611 degrees of freedom
Multiple R-squared: 0.4555, Adjusted R-squared: 0.4555
F-statistic: 1.808e+04 on 1 and 21611 DF, p-value: < 2.2e-16
not really…
cor(house_bool$sqft_living, house_bool$grade)
[1] 0.7627045
high correlation between the sqft_living and the grade. Grade will not be added to the model
Let’s check with anova
mod1a_sqft_grade <- lm(price ~ sqft_living + grade,
data = house_bool)
anova(mod1a_sqft, mod1a_sqft_grade)
models with response ‘"price"’ removed because response differs from model 1
Analysis of Variance Table
Response: log(price)
Df Sum Sq Mean Sq F value Pr(>F)
log(sqft_living) 1 2730.8 2730.81 18079 < 2.2e-16 ***
Residuals 21611 3264.3 0.15
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We could use it
summary(mod1a_sqft_grade)
Call:
lm(formula = price ~ sqft_living + grade, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1065457 -138304 -25043 100447 4794633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.981e+05 1.330e+04 -44.98 <2e-16 ***
sqft_living 1.844e+02 2.869e+00 64.29 <2e-16 ***
grade 9.855e+04 2.241e+03 43.97 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 250500 on 21610 degrees of freedom
Multiple R-squared: 0.5345, Adjusted R-squared: 0.5345
F-statistic: 1.241e+04 on 2 and 21610 DF, p-value: < 2.2e-16
but not much improvement in the R-squared
house_num_2 <- house_num %>%
add_residuals(mod1a_sqft) %>%
select(-c("price", "sqft_living"))
ggpairs(house_num_2)
house_cat_2 <- house_bool %>%
add_residuals(mod1a_sqft) %>%
select(-c("price", "bedrooms", "bathrooms", "sqft_living",
"floors", "lat", "long", "grade"))
ggpairs(house_cat_2)
NA
Latitude and view have the best correlation with the residuals:
mod2_sqft_lat <- lm(price ~ sqft_living + lat,
data = house_bool)
mod2_sqft_view <- lm(price ~ sqft_living + view,
data = house_bool)
summary(mod2_sqft_lat)
Call:
lm(formula = price ~ sqft_living + lat, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1487994 -125643 -20309 84613 4368717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.416e+07 5.653e+05 -60.44 <2e-16 ***
sqft_living 2.749e+02 1.794e+00 153.27 <2e-16 ***
lat 7.177e+05 1.189e+04 60.36 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241900 on 21610 degrees of freedom
Multiple R-squared: 0.566, Adjusted R-squared: 0.566
F-statistic: 1.409e+04 on 2 and 21610 DF, p-value: < 2.2e-16
plot(mod2_sqft_lat)
summary(mod2_sqft_view)
Call:
lm(formula = price ~ sqft_living + view, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1583570 -139453 -19322 104245 4321083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16853.463 4257.279 -3.959 7.56e-05 ***
sqft_living 256.176 1.934 132.485 < 2e-16 ***
view 102951.162 2317.467 44.424 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 250300 on 21610 degrees of freedom
Multiple R-squared: 0.5353, Adjusted R-squared: 0.5352
F-statistic: 1.245e+04 on 2 and 21610 DF, p-value: < 2.2e-16
plot(mod2_sqft_view)
Looks like latitude is the best Is there a correlation between latitude and view?
cor(house_bool$lat, house_bool$view)
[1] 0.006156732
not really - let see if we can improve the model with the view:
mod3_sqft_lat_view <- lm(price ~ sqft_living + lat + view,
data = house_bool)
summary(mod3_sqft_lat_view)
Call:
lm(formula = price ~ sqft_living + lat + view, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1596893 -115223 -14728 82111 4327282
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.438e+07 5.363e+05 -64.11 <2e-16 ***
sqft_living 2.502e+02 1.775e+00 140.93 <2e-16 ***
lat 7.228e+05 1.128e+04 64.08 <2e-16 ***
view 1.042e+05 2.125e+03 49.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 229400 on 21609 degrees of freedom
Multiple R-squared: 0.6095, Adjusted R-squared: 0.6094
F-statistic: 1.124e+04 on 3 and 21609 DF, p-value: < 2.2e-16
Hurray!! Let see what else to use
house_3 <- house_bool %>%
add_residuals(mod3_sqft_lat_view) %>%
select(-c("price", "grade", "lat", "view"))
ggpairs(house_3)
Waterfront looks good so let’s go with that
mod_4 <- lm(price ~ sqft_living + lat + view + waterfront,
data = house_bool)
summary(mod_4)
Call:
lm(formula = price ~ sqft_living + lat + view + waterfront, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1497101 -113377 -14143 82029 4400900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.467e+07 5.249e+05 -66.05 <2e-16 ***
sqft_living 2.507e+02 1.737e+00 144.35 <2e-16 ***
lat 7.288e+05 1.104e+04 66.02 <2e-16 ***
view 7.691e+04 2.258e+03 34.06 <2e-16 ***
waterfront 5.969e+05 1.928e+04 30.95 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 224500 on 21608 degrees of freedom
Multiple R-squared: 0.6261, Adjusted R-squared: 0.626
F-statistic: 9045 on 4 and 21608 DF, p-value: < 2.2e-16
not much added value - let’s try with the next best :year_build
mod_4b <- lm(price ~ sqft_living + lat + view + yr_built,
data = house_bool)
summary(mod_4b)
Call:
lm(formula = price ~ sqft_living + lat + view + yr_built, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1708731 -111155 -10443 84211 4129084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.941e+07 5.681e+05 -51.77 <2e-16 ***
sqft_living 2.664e+02 1.878e+00 141.87 <2e-16 ***
lat 6.744e+05 1.131e+04 59.62 <2e-16 ***
view 9.589e+04 2.125e+03 45.13 <2e-16 ***
yr_built -1.370e+03 5.692e+01 -24.06 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 226400 on 21608 degrees of freedom
Multiple R-squared: 0.6197, Adjusted R-squared: 0.6196
F-statistic: 8802 on 4 and 21608 DF, p-value: < 2.2e-16
nope! mod_4 is still the best
2 Extensions Consider possible interactions between your four main effect predictors and test their effect upon r2. Choose your best candidate interaction and visualise its effect. mod_4 <- lm(price ~ sqft_living + lat + view + waterfront,
data = house_bool)
mod5_a <- lm(price ~ sqft_living + lat + view + waterfront + sqft_living:waterfront,
data = house_bool)
mod5_b <- lm(price ~ sqft_living + lat + view + waterfront + waterfront:view,
data = house_bool)
summary(mod5_a)
Call:
lm(formula = price ~ sqft_living + lat + view + waterfront +
sqft_living:waterfront, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1418430 -111833 -15343 78675 4479310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.431e+07 5.130e+05 -66.89 <2e-16 ***
sqft_living 2.422e+02 1.718e+00 140.98 <2e-16 ***
lat 7.217e+05 1.079e+04 66.89 <2e-16 ***
view 7.944e+04 2.208e+03 35.98 <2e-16 ***
waterfront -5.064e+05 3.930e+04 -12.89 <2e-16 ***
sqft_living:waterfront 3.477e+02 1.087e+01 31.99 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 219400 on 21607 degrees of freedom
Multiple R-squared: 0.643, Adjusted R-squared: 0.6429
F-statistic: 7783 on 5 and 21607 DF, p-value: < 2.2e-16
summary(mod5_b)
Call:
lm(formula = price ~ sqft_living + lat + view + waterfront +
waterfront:view, data = house_bool)
Residuals:
Min 1Q Median 3Q Max
-1496810 -113388 -14067 81993 4401105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.466e+07 5.249e+05 -66.038 < 2e-16 ***
sqft_living 2.508e+02 1.737e+00 144.345 < 2e-16 ***
lat 7.287e+05 1.104e+04 66.009 < 2e-16 ***
view 7.682e+04 2.263e+03 33.940 < 2e-16 ***
waterfront 5.278e+05 1.196e+05 4.413 1.02e-05 ***
view:waterfront 1.843e+04 3.147e+04 0.586 0.558
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 224500 on 21607 degrees of freedom
Multiple R-squared: 0.6261, Adjusted R-squared: 0.626
F-statistic: 7236 on 5 and 21607 DF, p-value: < 2.2e-16
house_bool %>%
ggplot(aes(x = sqft_living, y = price, colour = waterfront)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Calculate the relative importance of predictors from your best 4-predictor model (i.e. the model without an interaction). Which predictor affects price most strongly?
library(relaimpo)
calc.relimp(mod_4, type = "lmg", rela = TRUE)
Response variable: price
Total response variance: 134782378397
Analysis based on 21613 observations
4 Regressors:
sqft_living lat view waterfront
Proportion of variance explained by model: 62.61%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
sqft_living 0.67161080
lat 0.13512450
view 0.13235564
waterfront 0.06090906
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs
sqft_living 280.6236 267.8727 257.8186 250.7462
lat 813411.5832 782766.2372 754623.2513 728807.2243
view 190335.2479 152740.5976 114877.1645 76912.3892
waterfront 1130312.4247 839722.1754 664548.3041 596863.2631
sqft_living affects the price most strongly